This paper presents a scalable Twitter sentiment analysis system using Natural Language Processing (NLP) and Logistic Regression. A dataset of approximately [1]6 million labeled tweets is utilized for binary sentiment classification. Data preprocessing includes tokenization, stopword removal, and text normalization, followed by TF-IDF vectorization and vector normalization to achieve effective feature representation. Logistic Regression is selected due to its computational efficiency and strong performance on high-dimensional data. The system is deployed using Streamlit to enable real-time sentiment prediction. The results show that, although deep learning models may achieve higher accuracy, but they are prone to overfitting; in contrast, the proposed approach ensures balanced performance with lower computational cost and improved generalization.
Introduction
The text presents a Twitter sentiment analysis system designed to extract public opinion from large-scale social media data using Natural Language Processing (NLP). The study focuses on handling the complexity of Twitter data, which includes slang, emojis, noise, and informal language, as well as the challenge of processing a large dataset of around 1.6 million tweets.
The proposed solution uses a scalable machine learning approach based on Logistic Regression and TF-IDF feature extraction. The system includes preprocessing steps such as cleaning text, removing noise, and converting emojis into meaningful representations. Logistic Regression is chosen for its balance of accuracy and interpretability compared to more complex deep learning models.
The project objectives include building a scalable sentiment analysis pipeline, preprocessing large datasets, extracting features using TF-IDF, training a classification model, and deploying a real-time prediction interface using Streamlit.
The methodology follows a modular pipeline consisting of data preprocessing, feature extraction, model training, evaluation (using accuracy, precision, recall, and F1-score), and deployment. The system is implemented in Python using a Jupyter Notebook for development and a Streamlit web application for real-time sentiment prediction.
Overall, the study demonstrates an efficient and scalable approach to sentiment classification on Twitter data using traditional machine learning techniques combined with a user-friendly deployment system.
Conclusion
This project presents a Twitter Sentiment Analysis system using NLP, Logistic Regression, and Streamlit. Trained on a Kaggle dataset of 1.6 million labelled tweets, it enables large-scale text classification using TF-IDF features. The system supports real-time sentiment prediction and demonstrates that traditional machine learning with Streamlit integration is effective for deploying practical and scalable NLP applications.
References
[1] Qutab, U. Fatima and I. Ahmed, International journal of innovative science and research technology, “Analyzing COVID-19 Sentiments on Twitter: An Effective Machine Learning Approach,” pp.841–850, Aug.2024 doi 10.38124/ijisrt/ijisrt24aug640.
[2] D. Kavitha, S. Venkatraman, K. CR, and N. S. Nair, Advances in systems analysis, software engineering, and high-performance computing book series, “Machine Learning-Based Sentiment Analysis of Twitter Using Logistic Regression,” pp. 308–319, June 2024, doi: 10.4018/979-8-3693-3502-4.ch020.
[3] E. Vladi?, B. Mehanovi?, M. Novali?, D. Ke?o, and D. Mehanovi?, Proceedings of the 3rd International Conference on NLP and Machine Learning Trends (NLMLT 2024), “Sentiment Classification of Tweets using Machine Learning and NLP Techniques,” Oct. 2024, pp. 35–45, doi: 10.5121/csit.2024.142004
[4] P. Dhanalakshmi, G. Ashish Kumar, B. Sai Satwik, K. Sreeranga, A. Tharun Sai, and G. Jashwanth, International Conference on Intelligent Systems for Communication, IoT and Security (ICISCoIS), \"Sentiment Analysis Using VADER and Logistic Regression Techniques,\" Coimbatore, India, 2023, pp. 139-144, doi: 10.1109/ICISCoIS5654[1]2023.10100565.
[5] S. Raheja and A. Asthana, International journal of software innovation, “Sentiment Analysis of Tweets During the COVID-19 Pandemic Using Multinomial Logistic Regression,” vol. 11, no. 1, pp. 1–16, Jan. 2023, doi: 10.4018/ijsi.315740
[6] E. Cerraho?lu and P. Cihan, International Conference on Pioneer and Innovative Studies (ICPIS), \"Sentiment Analysis and Emojification of Tweets,\" vol. 1, pp. 481–486, June 2023. DOI: 10.59287/icpis.876
[7] A. Muslim, A. Benny, R. Refianti, C. Maisyarah, and G. Setiawan, International Journal of Advanced Computer Science and Applications, “Comparison of Accuracy between Long Short-Term Memory-Deep Learning and Multinomial Logistic Regression-Machine Learning in Sentiment Analysis on Twitter,” vol. 11, no. 2, Jan. 2020, doi: 10.14569/IJACSA.2020.0110294.
[8] R. Lakshmi, S. R. B. Divya, and R. Valarmathi, International journal of engineering and technology, “Analysis of sentiment in twitter using logistic regression,” vol. 7, p. 619, June 2018, doi: 10.14419/IJET.V7I2.33.14849.
[9] A. Gangawane, International Journal of Computer Applications, “Opinion Mining and Sentiment Analysis on Twitter,” vol. 182, no. 10, pp. 32–35, Aug. 2018, doi: 10.5120/IJCA2018917718.
[10] A. Go, R. Bhayani, and L. Huang, “Twitter sentiment classification using distant supervision,” Stanford University, Stanford, CA, USA, Tech. Rep., 2009. [Online]. Available: https://www.kaggle.com/datasets/kazanova/sentiment140